R is the most popular free software environment for statistical computing and graphics. ggplot2 is a data visualization package for R that can be used to produce publication-quality graphics. This workshop is designed to introduce you to R and ggplot as well as RStudio, KnitR, Slidify, and Shiny.
R is a central piece of the Big Data Analytics Revolution, for example, see http://opensource.com/business/14/7/interview-david-smith-revolution-analytics for an article entitled “Big data influencer on how R is paving the way”
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-apple-darwin13.3.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.4 evaluate_0.5.5 formatR_1.0 htmltools_0.2.4
## [5] knitr_1.6 rmarkdown_0.2.64 stringr_0.6.2 tools_3.1.1
## [9] yaml_2.1.13
You also need to install LaTeX if you want to generate PDF files from KnitR.
Phils-MacBook-Pro:Mine pcannata$ pwd
/Users/pcannata/Mine
Phils-MacBook-Pro:Mine pcannata$ git clone https://github.com/pcannata/RWorkshop.git
Cloning into ‘RWorkshop’…
remote: Counting objects: 19, done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 19 (delta 3), reused 19 (delta 3)
Unpacking objects: 100% (19/19), done.
Checking connectivity… done
Phils-MacBook-Pro:Mine pcannata$ ls -a RWorkshop/
. .Rprofile.R 00 Doc 03 ggplot 05 KnitR 02
.. .git 01 Basic R 04 KnitR 01 RWorkshop.Rproj
Create an new file named .Rprofile by copying a file, see below
Copy what’s in blue above into the new file named .Rprofile as below,
See also http://cran.r-project.org/doc/manuals/r-devel/R-lang.html, http://www.r-tutor.com/r-introduction, and http://www.cookbook-r.com/
source("../01 Basic R/Basic.R", echo = TRUE)
##
## > "Variables"
## [1] "Variables"
##
## > v <- 211
##
## > v
## [1] 211
##
## > "Global Variables"
## [1] "Global Variables"
##
## > g <<- 234
##
## > g
## [1] 234
##
## > "Vectors"
## [1] "Vectors"
##
## > v1 <- c(1, 2, 3, 4, 5)
##
## > v1
## [1] 1 2 3 4 5
##
## > v2 <- 1:11
##
## > v2
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## > v3 <- -5:5
##
## > v3
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
##
## > "Vector Operations"
## [1] "Vector Operations"
##
## > v1
## [1] 1 2 3 4 5
##
## > v1 + 2
## [1] 3 4 5 6 7
##
## > v2
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## > sqrt(v2)
## [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162 3.317
##
## > v2
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## > v3
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
##
## > v2 + v3
## [1] -4 -2 0 2 4 6 8 10 12 14 16
##
## > length(v3)
## [1] 11
##
## > mean(4:22)
## [1] 13
##
## > "Data Types: Numeric, Character, Dates, Logical(TRUE, FALSE)"
## [1] "Data Types: Numeric, Character, Dates, Logical(TRUE, FALSE)"
##
## > "Missing Data: NA"
## [1] "Missing Data: NA"
##
## > v <- c(1, 2, NA, 3)
##
## > v
## [1] 1 2 NA 3
##
## > "Missing Data: NULL"
## [1] "Missing Data: NULL"
##
## > v <- c(1, 2, NULL, 3)
##
## > v
## [1] 1 2 3
##
## > "Functions"
## [1] "Functions"
##
## > "Functions will be introduced in the section pn ggplot below, however, let's have a look at the apropos() function:"
## [1] "Functions will be introduced in the section pn ggplot below, however, let's have a look at the apropos() function:"
##
## > apropos("mean")
## [1] ".colMeans" ".rowMeans" "colMeans" "kmeans"
## [5] "mean" "mean.Date" "mean.default" "mean.difftime"
## [9] "mean.POSIXct" "mean.POSIXlt" "rowMeans" "weighted.mean"
##
## > "Data Structures: Dataframes, Lists, Matricies, and Arrays. Only Dataframes will be addressed in this workshop."
## [1] "Data Structures: Dataframes, Lists, Matricies, and Arrays. Only Dataframes will be addressed in this workshop."
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) # df is a data frame
head(df)
## n s b
## 1 2 aa TRUE
## 2 3 bb FALSE
## 3 5 cc TRUE
Dataframes can be loaded from databases, CSVs, Excel, etc.. Loading dataframes from an Oracle database will be discussed later in this Workshop.
See also http://www.r-tutor.com/r-introduction/data-frame
Many R packages come with demo dataframes. The ggplot package comes with a demo dataframe called diamonds which we will use for this workshop.
source("../02 R Dataframes/Dataframes.R", echo = TRUE)
##
## > library("ggplot2")
##
## > "Displaying the top few rows of a dataframe:"
## [1] "Displaying the top few rows of a dataframe:"
##
## > head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
##
## > "Selecting a subset of columns from a dataframe:"
## [1] "Selecting a subset of columns from a dataframe:"
##
## > head(subset(diamonds, select = c(carat, cut)))
## carat cut
## 1 0.23 Ideal
## 2 0.21 Premium
## 3 0.23 Good
## 4 0.29 Premium
## 5 0.31 Good
## 6 0.24 Very Good
##
## > "Selecting a subset of rows from a dataframe:"
## [1] "Selecting a subset of rows from a dataframe:"
##
## > head(subset(diamonds, cut == "Ideal" & price > 5000))
## carat cut color clarity depth table price x y z
## 11417 1.16 Ideal E SI2 62.7 56.0 5001 6.69 6.73 4.21
## 11418 1.16 Ideal E SI2 59.9 57.0 5001 6.80 6.82 4.08
## 11422 1.07 Ideal I SI1 61.7 56.1 5002 6.57 6.59 4.06
## 11423 1.10 Ideal H SI2 62.0 56.5 5002 6.58 6.63 4.09
## 11424 1.20 Ideal J SI1 62.1 55.0 5002 6.81 6.84 4.24
## 11431 1.14 Ideal H SI1 61.6 57.0 5003 6.70 6.75 4.14
##
## > "Find average price group by color (plyr package is needed)"
## [1] "Find average price group by color (plyr package is needed)"
##
## > library("plyr")
##
## > ddply(subset(diamonds, cut == "Ideal" & price > 5000),
## + ~color, summarise, o = mean(price, na.rm = TRUE))
## color o
## 1 D 9057
## 2 E 9065
## 3 F 9704
## 4 G 9392
## 5 H 8923
## 6 I 9663
## 7 J 9407
For more on subsetting dataframes see http://www.ats.ucla.edu/stat/r/faq/subset_R.htm
RJDBC is an R package for makeing database connections in R.
See also http://www.rforge.net/RJDBC/, and http://bommaritollc.com/2012/11/connecting-r-to-an-oracle-database-with-rjdbc/
source("../02A RJDBC/ConnectToOracle.R", echo = TRUE)
##
## > "\nPut the following into .bash_profile \nexport JAVA_HOME=`/usr/libexec/java_home` \n. ./.bash_profile \n\nDownload ojdbc6.jar into ~/Downloads \ns ..." ... [TRUNCATED]
## [1] "\nPut the following into .bash_profile \nexport JAVA_HOME=`/usr/libexec/java_home` \n. ./.bash_profile \n\nDownload ojdbc6.jar into ~/Downloads \nsudo mv ~/Downloads/ojdbc6.jar $JAVA_HOME \n"
##
## > Sys.setenv(JAVA_HOME = "/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home")
##
## > options(java.parameters = "-Xmx2g")
##
## > library(rJava)
##
## > .jinit()
##
## > print(.jcall("java/lang/System", "S", "getProperty",
## + "java.version"))
## [1] "1.7.0_60-ea"
##
## > library(RJDBC)
## Loading required package: DBI
##
## > jdbcDriver <- JDBC(driverClass = "oracle.jdbc.OracleDriver",
## + classPath = "~/ojdbc6.jar")
##
## > e1 <- 7369
##
## > e2 <- "SMITH"
##
## > e3 <- "CLERK"
##
## > e4 <- 7902
##
## > e5 <- "17-DEC-1980"
##
## > e6 <- 800
##
## > e7 <- 20
##
## > emps <- data.frame(e1, e2, e3, e4, e5, e6, e7)
##
## > col_headings <- c("EMPNO", "ENAME", "JOB", "MGR",
## + "HIREDATE", "SAL", "DEPTNO")
##
## > names(emps) <- col_headings
##
## > possibleError <- tryCatch(jdbcConnection <- dbConnect(jdbcDriver,
## + "jdbc:oracle:thin:@zenji.microlab.cs.utexas.edu:1521:orcl",
## + "C##cs34 ..." ... [TRUNCATED]
##
## > if (!inherits(possibleError, "error")) {
## + emps <- dbGetQuery(jdbcConnection, "select * from emp")
## + dbDisconnect(jdbcConnection)
## + }
## [1] TRUE
##
## > ggplot(data = emps) + geom_histogram(aes(x = SAL))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot is an R package for data exploration and visualization. It produces production quality graphics and allows you to slice and dice your data in many different ways. ggplot uses a general scheme for data visualization which breaks graphs up into semantic components such as scales and layers. In contrast to other graphics packages, ggplot2 allows the user to add, remove or alter components in a plot at a high level of abstraction.
See also http://ggplot2.org/, http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf, and https://groups.google.com/forum/#!forum/ggplot2
source("../03 ggplot/Plots.R", echo = TRUE)
##
## > options(java.parameters = "-Xmx2g")
##
## > jdbcDriver <- JDBC(driverClass = "oracle.jdbc.OracleDriver",
## + classPath = "~/ojdbc7.jar")
##
## > con <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@128.83.138.158:1521:orcl",
## + "c##cs347_zi322", "orcl_zi322")
##
## > HCAHPSMeasure = dbGetQuery(con, "Select * From mc_HCAHPSMeasure")
##
## > InpatientServices <- dbGetQuery(con, "Select * from mc_InpatientServices")
##
## > Providers = dbGetQuery(con, "Select * from mc_Providers")
##
## > OutpatientServices <- dbGetQuery(con, "Select * from mc_OutpatientServices")
##
## > OutpatientVisits <- dbGetQuery(con, "select * from mc_OutpatientVisits_2 WHERE ID BETWEEN 0 and 30000")
##
## > OutpatientVisits = rbind(OutpatientVisits, dbGetQuery(con,
## + "select * from mc_OutpatientVisits_2 WHERE ID BETWEEN 30001 and 60000"))
##
## > OutpatientVisits = rbind(OutpatientVisits, dbGetQuery(con,
## + "select * from mc_OutpatientVisits_2 WHERE ID BETWEEN 60001 and 90000"))
##
## > OutpatientVisits = rbind(OutpatientVisits, dbGetQuery(con,
## + "select * from mc_OutpatientVisits_2 WHERE ID BETWEEN 90001 and 100000"))
##
## > InpatientVisits <- dbGetQuery(con, "select * from mc_InpatientVisits WHERE ID BETWEEN 0 and 30000")
##
## > InpatientVisits = rbind(InpatientVisits, dbGetQuery(con,
## + "select * from mc_InpatientVisits WHERE ID BETWEEN 30001 and 60000"))
##
## > InpatientVisits = rbind(InpatientVisits, dbGetQuery(con,
## + "select * from mc_InpatientVisits WHERE ID BETWEEN 60001 and 90000"))
##
## > InpatientVisits = rbind(InpatientVisits, dbGetQuery(con,
## + "select * from mc_InpatientVisits WHERE ID BETWEEN 90001 and 100000"))
##
## > InpatientVisits = rbind(InpatientVisits, dbGetQuery(con,
## + "select * from mc_InpatientVisits WHERE ID BETWEEN 100001 and 130000"))
##
## > InpatientVisits = rbind(InpatientVisits, dbGetQuery(con,
## + "select * from mc_InpatientVisits WHERE ID BETWEEN 130001 and 160000"))
##
## > head(InpatientVisits)
## ID DRGID PROVIDERID NUMDISCHARGEDPATIENTS COVEREDCHARGES TOTALPAYMENTS
## 1 666 39 340073 14 24347 5770.785714
## 2 667 39 340091 67 13557 6375.268657
## 3 668 39 340109 17 15891 7125.058824
## 4 669 39 340113 60 43530 8805.083333
## 5 670 39 340114 26 21445 6177.230769
## 6 671 39 340115 55 16703 6198.163636
## MEDICAREPAYMENT YEAR
## 1 4785 2012
## 2 5289 2012
## 3 6112 2012
## 4 6573 2012
## 5 4427 2012
## 6 5299 2012
##
## > outpatientCostByCity = dbGetQuery(con, "SELECT mc_Providers.City as City, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCost \nFROM m ..." ... [TRUNCATED]
##
## > outpatientCostByState = dbGetQuery(con, "SELECT mc_Providers.State as State, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCost \nFRO ..." ... [TRUNCATED]
##
## > outpatientCostByHospital = dbGetQuery(con, "\nSELECT mc_Providers.Name as Hospital, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCos ..." ... [TRUNCATED]
##
## > outpatientCostByCity = dbGetQuery(con, "SELECT mc_Providers.City as City, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCost \nFROM m ..." ... [TRUNCATED]
##
## > outpatientCostByState = dbGetQuery(con, "SELECT mc_Providers.State as State, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCost \nFRO ..." ... [TRUNCATED]
##
## > outpatientCostByHospital = dbGetQuery(con, "SELECT mc_Providers.Name as Hospital, AVG(mc_OutPatientVisits.AverageSubmittedCharges) as AvgBilledCost ..." ... [TRUNCATED]
##
## > InpatientCostByCity = dbGetQuery(con, "SELECT mc_Providers.City as City, AVG(mc_InPatientVisits.CoveredCharges) as AvgBilledCost \nFROM mc_InPatient ..." ... [TRUNCATED]
##
## > InpatientCostByState = dbGetQuery(con, "SELECT mc_Providers.State as State, AVG(mc_InpatientVisits.CoveredCharges) as AvgBilledCost \nFROM mc_Inpati ..." ... [TRUNCATED]
##
## > InpatientCostByHospital = dbGetQuery(con, "\nSELECT mc_Providers.Name as Hospital, AVG(mc_InPatientVisits.CoveredCharges) as AvgBilledCost \nFROM mc ..." ... [TRUNCATED]
##
## > p1 <- ggplot(InpatientCostByState, aes(x = STATE,
## + y = AVGBILLEDCOST)) + geom_point() + coord_flip()
##
## > p2 <- ggplot(outpatientCostByState, aes(x = STATE,
## + y = AVGBILLEDCOST)) + geom_point() + coord_flip()
The Chapter 7 of “R for Everyone” has many more examples of ggplots.
source("../03 ggplot/plotFunction.R", echo = TRUE)
##
## > FigureNum <<- 0
##
## > ggplot_func <- function(df, Title = "Diamonds", Legend = "color",
## + PointColor = c("red", "blue", "green", "yellow", "grey",
## + "black" .... [TRUNCATED]
##
## > p1 <- ggplot_func(diamonds)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
##
## > p1
##
## > p2 <- ggplot_func(diamonds, YMin = 5000, YMax = 15000)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
##
## > p2
## Warning: Removed 40868 rows containing missing values (geom_point).
##
## > p3 <- ggplot_func(subset(diamonds, cut == "Premium"),
## + Legend = "cut")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
##
## > p3
##
## > p4 <- ggplot_func(diamonds, Legend = "clarity", PointColor = c("red",
## + "blue", "green", "yellow", "grey", "black", "purple", "orange"))
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
##
## > p4
##
## > library("grid", lib.loc = "/Library/Frameworks/R.framework/Versions/3.0/Resources/library")
##
## > png("4diamonds.png", width = 25, height = 20, units = "in",
## + res = 72)
##
## > grid.newpage()
##
## > pushViewport(viewport(layout = grid.layout(2, 2)))
##
## > print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
##
## > print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
## Warning: Removed 40868 rows containing missing values (geom_point).
##
## > print(p3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
##
## > print(p4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
##
## > dev.off()
## pdf
## 2
You should now be able to open RWorkshop/00 Doc/4diamonds.png. It should look like the following plot.
KnitR is an R package designed to generate dynamic reports using a mix of the R, LaTex, and the Rmarkdown (see http://rmarkdown.rstudio.com/?version=0.98.945&mode=desktop) languages.
See also http://yihui.name/knitr/ and http://kbroman.github.io/knitr_knutshell/
Simple examples can be found in “04 KnitR/doc1.Rmd” and “04 KnitR/doc2.Rmd”. These can generate html, pdf, and word documents. The output from Kniting doc2.Rmd is,
You can use Slidify to generate HTML slide decks using only the Rmarkdown language.
See also http://slidify.org and http://slidify.org/start.html
Follow the instructions in “05 Slidify/slidify setup.R” to install and run slidify. You should be able to produce a slide deck with a first slide that looks something like the following.
Cool trick - Any github repo with a branch called gh-pages will get served as a website. If the content of that repo is the stuff of websites (html,css), then you get free web hosting. So, create a branch called gh-pages and push to it.
The shiny R package allows you to build interactive web-based applications using only R with no knowledge of html, css, or javascript needed. You just need to write two scripts (see the example files in the 06Shiny directory):
See also http://shiny.rstudio.com and http://shiny.rstudio.com/tutorial
To run the shiny app that’s in the 06Shiny directory run the following in the main RWorkshop directory (make sure the working directory is set to this directory):
library(shiny)
runApp(“06Shiny”) # Make sure there are no spaces in the string argument to runAPP
This should pop the application up in a browser, you can also access it in a browser at http://127.0.0.1:6837. It should look like the following.
The example above ran the shiny app on your local machine, but to share with others, you have to send around the R files and the user needs to have R and know a little bit about it.
Instead, you can remotely host shiny apps and then just send people links. Get a free account at shinyapps.io/signup.html and give it a try.
library(“devtools”, lib.loc=“/Library/Frameworks/R.framework/Versions/3.0/Resources/library”)
install_github( repo = “shinyapps”, username=“rstudio” )
shinyapps::setAccountInfo(name=‘pcannata’, token=‘3ECF447A741004F6A8B7208C9ED778E1’, secret=‘. . .’)
library(shinyapps)
getwd()
## [1] "/Users/alyshialedlie/RWorkshop/00 Doc"
# Uncomment the following line to deploy the app.
#deployApp("../06Shiny")
Now you can try the app at https://pcannata.shinyapps.io/06Shiny/
See also https://www.shinyapps.io/ and http://shiny.rstudio.com/articles/shinyapps.html